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Abstract —Motivated by the center-surround mechanism in 
the human visual attention system, we propose to use average 
contrast maps for the challenge of pedestrian detection in street 
scenes due to the observation that pedestrians indeed exhibit 
discriminative contrast texture. Our main contributions are first 
to design a local, statistical multi-channel descriptor in order 
to incorporate both color and gradient information. Second, we 
introduce a multi-direction and multi-scale contrast scheme based 
on grid-cells in order to integrate expressive local variations. 
Contributing to the issue of selecting most discriminative features 
for assessing and classification, we perform extensive comparisons 
w.r.t. statistical descriptors, contrast measurements, and scale 
structures. This way, we obtain reasonable results under various 
configurations. Empirical findings from applying our optimized 
detector on the INRIA and Caltech pedestrian datasets show 
that our features yield state-of-the-art performance in pedestrian 
detection. 

Keywords — center-surround contrast , human vision, channels, 
multi-direction, multi-scale, pedestrian detection. 

I. Introduction 

HE problem of pedestrian detection is attracting growing 
attention in the computer vision community as it has many 
practical applications in areas such as video surveillance or 
driving assistance systems. There is a quickly growing body 
of work on accurate and efficient detection of pedestrians in 
image or video data (see [1] for a recent survey). Contributions 
have been made regarding problems such as feature extraction, 
classifier design, occlusion handling and the like. Although 
there were significant improvements over the last decade, 
one must acknowledge that the precision of state-of-the-art 
pedestrian detectors still lags behind human vision, which 
is capable of rapidly localizing pedestrians under various 
scales, poses, and occlusions even in low quality images. 
This motivates us to analyze how the human vision system 
processes incoming stimuli and to devise corresponding novel 
features for pedestrian detection. In this paper, we present 
experimental results which show that the use of biologically 
inspired mechanisms can indeed aid recognition. 

In the human visual system, processing of information be¬ 
gins in the retinal tissue immediately after photoreceptive cells 
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Fig. 1: Human retinal tissue: (a) Schematic cross-section (im¬ 
age adapted from [ 2 ]); (b) Spatial wiring and DoG weighting 
of retinal ganglion cells. 


(rods for lightness resp. cones for colors) have transformed 
incident light into electric signals (cf. Fig. 1). In a first layer 
of bipolar cells, electrical membrane potentials are locally 
aggregated. Grouped bipolar cells report to different types of 
ganglion cells, which convert analog potentials into electric 
pulse rates. At the transitional synapses between photorecep¬ 
tive and bipolar cells, but also from bipolar to ganglion cells, 
there is a lateral wiring of so called horizontal respectively 
amacrine cells modulating the signals to enhance contrasts in a 
center-surround fashion. It was found that the output of certain 
ganglion cells agrees with simple difference of Gaussian (DoG) 
filter responses [3] or more complex oriented Gabor filter 
results [A]. A more in-depth survey on retinal cell types and 
their wiring can, for instance, be found in [5]. 

The center-surround mechanism is also found in later pro¬ 
cessing stages in the brain where it guides human attention 
and thus affects how people recognize objects of interest. This 
psychophysical theory has been widely used in computational 
approaches to generate saliency maps of the environment [6]. 
However, while attention is about bottom-up processing, 
model-free analysis of signals from the environment, visual 
search for specific entities requires top-down saliency which 
tunes the scoring of basic features to the expected appearance. 

In this article, we propose to emulate center-surround con¬ 
trast features motivated by the human visual system and to 
tune them towards characterizations of the appearance of 
pedestrians. Our previous findings about human vision driven 
features were published in [7]; in this paper, we explore the 
configurations of feature design and achieve better perfor¬ 
mance. Our contributions are summarized as follows: 

Statistical multi-channel cell descriptors: We collect 
multi-channel information for each cell area, i.e. local image 







patch, not only regarding lightness and colors, but also w.r.t. 
gradients which complement each other in recognizing broad 
variations of clothing or articulations of the human body. In 
order to summarize the underlying, unknown distribution of 
each cell’s channel values, we propose two kinds of distri¬ 
butions: (1) a continuous Gaussian distribution which models 
maximum entropy given a known mean and variance [8]; (2) 
a bilinear interpolated histogram which is a representation of 
frequencies observed over discrete intervals (bins). 

Multi-direction and -scale contrast vectors: Aiming at 
incorporating more specific information between central and 
surrounding cells, we treat adjacent image regions in different 
directions individually rather than as a single surrounding 
region and thus obtain multi-direction contrast descriptors; we 
compute statistical features at different cell-sizes so as to build 
a contrast pyramid which accords with the general architecture 
of most visual saliency systems. 

Extensive evaluations on various configurations: In or¬ 
der to determine the strongest feature scheme for pedestrian 
detection, we implement various contrast measurements for 
both distributions and at different scale structures. In extensive 
evaluations on the INRIA dataset, we find that the advisable 
scheme is to use a Gaussian-W^ combination and a 4-6-8-10 
scale structure. 

Our presentation proceeds as follows: Section II provides an 
overview of related work; Section III presents details as to our 
feature extraction mechanism. Two key components of this 
mechanism, namely statistical descriptors and contrast mea¬ 
sures are explained in Section IV and Section V, respectively. 
Our classification procedure is presented in Section VI, fol¬ 
lowed by a discussion of thorough and extensive experiments 
in Section VII where we compare different feature schemes 
and state-of-the-art detectors on standard benchmarks. Finally, 
we conclude and propose several directions for future work in 
Section VIII. 




Fig. 2: Heat maps of average center-surround contrasts gener¬ 
ated from positive and negative samples of the INRIA pedes¬ 
trian dataset. Warmer colors indicate higher contrast values. 


II. Related work 

Since our focus in this paper is on emulating the center- 
surround mechanism in human vision in order to design 
new contrast features for pedestrian detection, the following 
literature review mainly considers difference based features 
for pedestrians and center-surround contrast measures used by 
computational visual attention approaches. 

A. Features for pedestrian detection 

Most features for pedestrian detection interpret local or 
global pixel differences in various forms. This is because 
pixel differences represent texture information which are often 
characteristic for classes of objects and thus allow for robust 
classification. 

Gradients (vectors of directed derivatives) are popular fea¬ 
tures as they describe differences w.r.t. intensity or colors 
between neighboring pixels and allow for characterizing these 
in terms of magnitudes and orientations. The arguably most 
popular kind of feature Histograms of Oriented Gradients 
(HOGs) [9] for pedestrian detection is indeed built on gradient 


statistics. HOG features brought about significant improve¬ 
ments and therefore establish an important baseline. Several 
researchers have extended this feature pool and added further 
features. For example, Liu et al. [10] introduced the idea of 
a granularity space, i.e. a family of descriptors ranging from 
edgelets to HOGs. 

Local Binary Pattern (LBP) features [11] are another kind 
of pixel-wise difference based features which express relative 
intensity relationships between neighboring pixels by binary 
codes. Wang et al [12] combined LBP features with HOG 
features in order to better cope with occlusions; Ma et al. [1 ] 
proposed a set of edge orientation histogram (EOH) and 
oriented LBP based features to describe cell-level and block- 
level structure information. 

Haar-like features [14], on the other hand, are considered as 
patch-wise local differences as they compute sums of intensity 
values over rectangular image regions. Zhang et al [15] 
designed Haar-like templates tailored to up-right human body 
and achieved significant improvement. 

Color Self Similarity (CSS) features proposed by Walk 
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Fig. 3: Flow chart of our feature extraction mechanism. Here, we consider a three-scale structure as an example but different 
scale structures can be used as well. 


et al. [16] describe global differences between pairs of image 
cells in terms of color histograms. Significant improvements 
were achieved by combining HOG features and CSS features, 
since they allow for representing uniform textures found in 
people’s clothing. 

Although extensive efforts have been made to interpret local 
difference in various ways, the performance of all the above 
features is still far behind humans’ capabilities. We therefore 
argue that it is worthwhile to look into how the human 
brain processes visual inputs and next to mimic corresponding 
mechanisms in order to design more representative features. 

To our knowledge, the first attempt of designing human 
vision inspired features dedicated to pedestrian detection can 
be found in [17]. Although our motivation in this article is a 
similar one, our features considerably differ from the ones in 
[] ]. The following clear distinctions can be drawn: 1) we 
consider local difference between central and surrounding 
square regions rather than between pixels; 2) we compute the 
center-surround contrasts in multiple channels (not only on 
colors but also on gradient information); 3) we do not use 
image channel values directly but describe their distributions 
using statistical entities; 4) we do not treat neighboring regions 
as a whole but individually and thus incorporate more detailed 
information regarding local difference. 

B. Center-surround contrast measures 

Most computational approaches to visual attention deter¬ 
mine center-surround contrasts by DoG-filters or approxi¬ 
mations of these [] ]. Recently, several researchers repre¬ 
sented the central and surrounding areas in terms of feature 
distributions so as to capture more information about the 
areas. These distributions were either discrete, e.g. in form 
of histograms [19], or continuous, e.g. fitted to a normal 
distribution [20], and various distance measures can be applied 
between central and surrounding distributions to quantify local 
contrast. 

However, we notice that the above strategies only achieve 
reasonable results for rather conspicuous scenarios, e.g. a 
big red flower standing out in surrounding green leaves. In 


fact, the background in our case is much more complex and 
the previous contrast models are not guaranteed to perform 
well. Consequently, we train and evaluate specialized contrast 
schemes in this paper and aim to find out the optimal config¬ 
uration for our applications. 


III. Overview on feature extraction 

This section introduces the feature extraction procedure 
we consider in this paper. First of all, we demonstrate that 
center-surround contrasts are discriminative for pedestrians. 
We collect 2,416 positive samples (consisting of pedestrians) 
and 5,000 negative samples (no pedestrians included) from 
the INRIA pedestrian dataset [ 9 ] . All the sample color images 
are converted to gray images since considering only intensity 
is sufficient to showcase the possible performance gain. The 
contrast map of each image is computed on two scales of 
4x4 and 6x6 pixels. The contrast value, represented by the 
difference between central and surrounding cell regions w.r.t. 
mean value, is added to the central region. Finally, two average 
contrast maps are generated for the pedestrian class and non¬ 
pedestrian class, respectively. In Fig. 2, we see that the average 
contrast map for pedestrians indeed resembles a human body 
while the average contrast map for non-pedestrians shows no 
distinct pattern. 

Based on the above observation, we design our center- 
surround contrast features for pedestrian detection. An illus¬ 
tration of our feature extraction procedure is shown in Fig. 3. 
First, we compute multiple channels (e.g. color and gradient 
information) for each pixel in an image; second, we divide 
each channel map into square cells of a fixed size and describe 
each cell using statistical distributions; third, we compute the 
differences between each cell and its eight nearest neighboring 
cells so as to obtain a multi-direction contrast vector; finally, 
we repeat the second and third step along each channel with 
different cell sizes and thus obtain a multi-channel, multi¬ 
direction, and multi-scale contrast pyramid for the whole 
image. 

























































































Fig. 4: Illustration of a histogram of oriented gradients ob¬ 
tained for a single pixel. 


A. Center-surround contrasts 

The core part of our feature extraction is how to determine 
the difference between two rectangular regions. To address 
this problem, we first choose an appropriate distribution for 
each region, as discussed (see Section IV) and then consider 
corresponding contrast measures to numerically describe the 
difference between two given distributions (see Section V). 
In order to determine the strongest center-surround contrast 
features for pedestrian detection, we conduct extensive ex¬ 
periments and comprehensive comparisons on various com¬ 
binations of distributions and contrast measures. Experimental 
results under different schemes are presented in Section VII. 

B. Channels 

To consider multiple feature channels in our scheme is moti¬ 
vated by the success of Dollar’s detector [ChnFtrs] [21], which 
has been established a strong baseline due to its accuracy and 
efficiency. In [ChnFtrs], multiple channel maps, in terms of 
colors and gradients, are computed for each input image, and 
the final feature values consist of local sums at different spatial 
locations and over all channel maps. These local sums are 
efficient to compute by employing integral images. They are 
less sensitive to noise than the individual channel values. 

Similar to [ChnFtrs], we also consider a total of 10 dif¬ 
ferent channels: 3 channels for LUV colors, 1 channel for 
gradient magnitude information, and 6 channels for histograms 
of oriented gradients. Note that all the above channels are 
computed pixel by pixel. Histograms of oriented gradients are 
usually computed for a group of pixels inside an image region, 
but we compute them pixelwise which is to say we simply 
quantize the gradient magnitudes into orientation bins. For 
each pixel, two neighboring bins are affected as we employ 
bilinear interpolation w.r.t. orientation bins, see Fig. 4 for an 
illustration. 

Prior to channel computation, input images are smoothed 
with a binomial filter [2 ] of radius 1, i.e. a « 0.87, in order 
to remove noise. Note that we explicitly do not smooth channel 
data as we observed this to lead to decreased performance. 

C. Center-surround neighborhood patterns 

Here, we present details on our design of center-surround 
cell pairs. Four patterns are proposed in this paper and ex¬ 
plained in the following. 




Fig. 5: Illustration of two neighborhood patterns, (a) Sparse 
pattern. Each red arrow points from the central cell to one of 
its neighboring cells, (b) Shift pattern. Two layers of cells are 
denoted with green and blue grid lines and the red cells denote 
central cells with eight nearest neighboring cells. 


C\Sg pattern: For each cell, its eight nearest neighbor¬ 
ing cells are considered as surrounding cells, denoted as 
[C{ , C|,Cf]. The eight surrounding cells can be treated 
either as a whole or separately. From our experiments, we 
find a significantly better performance if they are treated 
individually (cf. Fig. 10), since difference information in eight 
directions are integrated respectively. Thus, we use this C\Sg 
pattern in our experiments to build a multi-direction contrast 
vector for each cell along every channel. 

Sparse pattern: Significant redundancy will emerge if we 
consider eight nearest neighboring cells for each cell, because 
each adjacent pair of cells is incorporated twice. To avoid this 
redundancy, we use a cell step of 2 cells along both horizontal 
and vertical directions, resulting in a sparse neighborhood map 
as shown in Fig. 5a. 

Shift pattern: According to the Nyquist-Shannon sampling 
theorem [2 ], we propose a shift mechanism where we define 
two cell layers and iterate the C\Sg center-surround pattern 
on each respectively. For the first layer, we start from the left 
top pixel and divide the whole image into square cells; for 
the second layer, the starting point is shifted with a step of 0.5 
times the cell size along both horizontal and vertical directions 
and we then divide the image patch from the new starting point 
into square cells with the same cell size. An illustration of the 
shift mechanism is shown in Fig. 5b. 

Multi-scale pattern: Finally, in order to describe contrasts at 
different scales, we use different cell sizes to build a contrast 
pyramid which is in accordance with the general architecture 
of most computational visual attention systems. 

IV. Statistical cell descriptors 

In order to assess the underlying distribution inside each 
cell, we estimate both continuous and discrete statistical ap¬ 
proximations: (1) a Gaussian distribution, which is the type of 
continuous distribution with maximum entropy given a known 
mean and variance; (2) a bilinear interpolated histogram, which 
is a representation of frequencies, determined for discrete 
intervals (bins). 

In our following discussion, we assume that we have mea¬ 
sured values of channel i for the whole input image. We denote 






































this data as channel image P l and consider a specific cell c 
with its channel vector Pl = [v\, ..., vl\. 


A. Gaussian distributions 

The true distribution of channel values for local image 
patches is unknown, but is modeled as Gaussian type in this 
section. This assumption is made not only because normality 
makes further estimations convenient to solve, but also due to 
its popularity in classic low-level vision models, for example, 
in [24]. 

For numerical description, we apply maximum likelihood 
(ML) estimation of the parameters and obtain mean and 
variance values as 


4 = ^E4 = T?, d) 

k=l 

and 

K = = (2) 

P k=l 

Now the estimation is na rrowed down to computing two 
local averages: P l c and (Pi) 2 according to Eq. 1 and Eq. 2. For 
efficiency, we employ two integral images for each channel: 
one for the original channel image P l and the other for 
the squared channel image (P 1 ) 2 and thus avoid extensive 
summations per individual cell. 

Once the parameters of the Gaussian have been determined, 
we obtain a descriptor for cell c for channel i: 

D i {c) = \£, E‘]. (3) 


B. Histograms 

Histograms are a reasonable discrete representation of distri¬ 
butions without any prior assumption of the underlying statis¬ 
tics. They count the observed frequencies of data appearing 
in discrete intervals. The advantage of using a histogram 
is furthermore that it tolerates noise and minor intra-class 
differences and that its degree of tolerance can be adjusted 
by choosing appropriate numbers of bins. Generally, using a 
smaller number of bins results in a coarser description of the 
original data, and vice versa. 

It is computationally expensive to naively compute his¬ 
tograms for all cells and all sizes so that we employ integral 
histograms [25]. An integral histogram can be considered as a 
stack of integral images each counting the sums of values to 
the top and left from a pixel that fall into a certain histogram 
bin. 

To eliminate bias, we implement bilinear interpolation for 
histograms, i.e. each value contributes into two nearest bins 
with a weight relating to distance between the given value 
and the bin center. Also, normalization is rather important for 
histograms, since it eliminates the effect of data magnitudes. 
In this paper, we normalize each local histogram for each cell 
and channel so that it sums up to 1. In the end, given b bins, 


we obtain a histogram H l c as a descriptor vector for channel 
vector P c !: 


b 

Hi = [ha i), k( 2 ), K(b)],j2 K(k) = i. (4) 

k=1 

V. Contrast measurements 

Aiming for the strongest center-surround contrast features, 
we introduce multiple contrast measurements for each dis¬ 
tribution descriptor to make a comprehensive comparison in 
this section. Combining a distribution descriptor and a cor¬ 
responding measurement forms a specific scheme for feature 
extraction. We note that the cell descriptors introduced above 
are statistical distributions whose comparison requires care. 
Although the Euclidean distance is often used in practice, 
it is not truly faithful to the nature of this kind of data. In 
particular, when comparing distributions or histograms, we are 
dealing with compositional data [26]. This is to say that, for a 
normalized histogram H = [h( 1),..., h(b)] of b bins, there are 
only b — 1 degrees of freedom, since the value of an arbitrary 
bin h{i) is determined by 1 — It is therefore 

impossible to perturb one bin of a histogram without affecting 
the others. This has implications for similarity measurements 
that are not accounted for by the Euclidean metric. However, 
there are several distance- or similarity measures that cope 
with these characteristics and we consider their use in our 
context. To summarize, we explore six different metrics in this 
paper: Gaussian->V 2 distance, Gaussian-P 2 distance, Gaussian 
gradient matrices, histogram Kullback-Leibler divergences, 
histogram Hellinger distances, and histogram intersections. 

In the following, we denote the channel distributions for 
a central and a surrounding cell as Pl and P s \ respectively. 
The contrast vector cst(Pl,P l s ) is computed using different 
measures. 


A. Gaussian distributions 

We introduce three different contrast measures to compute 
the difference between two cells’ channel distributions, each 
represented by the Gaussian descriptor in Eq. 3. We compare 
the results of those three measures in Section VII. 

1) W 2 distance: The W 2 distance (2nd Wasserstein dis¬ 
tance) was first introduced as a measure for center-surround 
contrast by Klein et al. [20] and achieved reasonable results 
for saliency detection. Its definition in our case can be written 
as: 


w 2 (Pi,pi) = 


inf 

7 er (p*,p; 


i) X) 


\x-y\ 2 d^(x,y) 


( 5 ) 


where T(P 2 ,P S 2 ) denotes the set of all couplings of Pl and 

Pl 

It would be intractable to compute the integral in Eq. 5 
in case of arbitrary distributions. However, for the Gaussian 
distribution, it can be solved analytically [27]. The contrast 
vector between one central cell distribution Pl Nb4,K) 






and its neighboring cell distribution P L S o f 
channel i indeed amounts to: 


mtPlPi) = 




( 6 ) 


We note that Wasserstein distances are natural metrics for 
the comparison of two probability distributions where one 
distribution is derived from the other one through small, non- 
uniform perturbations; in the computer vision literature, the 
discretized Wasserstein distance is also referred to as the Earth 
Mover’s distance [21 ]. 

2) L 2 distance: If we treat the two-dimensional descriptors 
for the central and surrounding cells as two 2D points, then 
the L 2 distance between (/z*,E*) and (/i*,E*) amounts to: 

d L 2 (Pi, Pi) = V(a4-/4) 2 + (£*-£*) 2 . (7) 

3) Signed gradient matrix (SGrd): For each center- 
surround cell pair, we compute the signed gradient matrix for 
the mean and variance vector [//,X2], resulting in a contrast 
vector. The contrast vector between one central cell distri¬ 
bution PI ~ 7V(/i*, E£) and its neighboring cell distribution 
P l s ~ N(n l s , E*) of channel i can then be expressed as follows: 


SG?d(Pi,Pl) 



( 8 ) 


In the feature space, the contrast vector in Eq. 8 is treated 
in terms of two separate values which enables a convenient 
training procedure. 


B. Histograms 

We consider three different distance measures which are 
commonly used for histograms. In the following, the his¬ 
tograms for a central and a surrounding cell w.r.t. channel i 
are denoted as in Eq. 4. We compare the results of the three 
measurements in Section VII. 

1) Kullback-Leibler divergence: Using information theo¬ 
retic arguments, one can represent the difference between a 
center and a surround cell using the Kullback-Leibler Diver¬ 
gence (KLD) [29], 


Scales 

4-6 

4-6-8 

4-6-8-10 

Feature size 

20,320D(cst) 

23,440 D(cst) 

25,040D(csl) 


TABLE I: Illustration of feature size under different configu¬ 
rations. All the contrast measurements used in this paper are 
one dimensional, except SGrd, which is two dimensional. 


2 ) Hellinger distance: Let P and Q be two probability 
distributions with respect to a probability measure A; the 
Hellinger distance is a measure of their difference that is 
independent of A. The square of the Hellinger distance has 
a particularly simple form and is defined as [3 ]: 



For two discrete probability distributions H l c and H\ that 
represent P l c and P l s , the Hellinger distance is then computed 
as the contrast between P l c and P l s \ 


h 2 (hi,h]) = 


• (i2) 

k=i ' ' 


3) Histogram intersection: The histogram intersection is 
another popular similarity measure for histograms. Given two 
histograms H p and H q with n bins, it is defined as: 

E min (H p (k),H q (k)) 

HI (H p , H q ) = k -=± ---. (13) 

E H p (k) 

k=1 

As all histograms considered in this paper are normalized 
so that they sum up to 1, the histogram intersection between 
H l c and H\ can be further simplified to: 

b 

HI (Hi, Hi) = min (K(k), h\(k)). (14) 

k=1 


VI. Classification 


r™ t)( P\ 

Dkl(P\\Q)= / p(x)ln^-d-dx. (9) 

4-oo q(x) 

Thus, the KLD between two probability distributions P and 
Q is a relative entropy that indicates the loss in information 
if P is approximated by Q. The more P differs from Q , the 
higher the KLD. 

Given the histograms H l c and H z s of two channel vectors 
PI and P s \ we calculate the discrete KLD as our first contrast 
measure: 

D KL (Hi\\Hi) = Y^\n(^^yi(k). (10) 


In this section, we discuss our approach towards classifica¬ 
tion of the center-surround features introduced above. First of 
all, we address the size of our feature pool. Given a pedestrian 
model of 60 x 120 pixels, Tab. I compares feature sizes 
under different settings in terms of scales and dimensions of 
contrast vectors. Apparently, the feature pool grows once more 
scales are employed. Among all the contrast measurements 
considered in this paper, only the signed gradient matrix is 
two dimensional, while all others are one dimensional. 

To efficiently train classifiers on such a large feature pool, 
we employ a fast version of AdaBoost [31] since it offers a 
convenient and fast approach to feature selection from a large 
number of candidate features. The feature selection procedure 
is conducted for each feature configuration individually. 
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Fig. 6: Illustration of representative features under one config¬ 
uration. (a) Body parts weight map: different colors are used to 
indicate the accumulative weight of each pixel after boosting, 
(b) Channel weight bars: accumulative weight of each channel 
is indicated by one bar. 
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For boosting algorithms, one should choose proper weak 
classifiers so as to build the final strong classifier. We use 
decision trees of depth 2 as our weak classifiers since they are 
efficient to learn. Another important parameter is the number 
of weak classifiers, which, after extensive experimentation, we 
choose to be 4096, as we observe that more weak classifiers do 
not lead to gains in performance. Similar to classic approaches 
to pedestrian detection [9], [21], we also employ a multi¬ 
round training strategy which has been shown to lead to better 
performance than a simple one round training procedure with 
the same number of samples. For the first round, the initial 
negative training samples are randomly cropped from the neg¬ 
ative example images; in the following rounds, hard negative 
samples are exhaustively searched over all negative example 
images using the classifier built in the previous round. This 
procedure is iterated until no significant performance gains are 
observed with further retraining. From our experiments, three 
rounds of retraining yield optimal performance; additional 
rounds did not show significant improvements. We collect 5000 
negative samples at each round, resulting in a large negative 
sample pool of 20,000 image patches after four rounds. 

In order to look into which local features regions are more 
informative, we plot a weight image of the top 100 feature 
positions with highest weights from the final strong classifier, 
as shown in Fig. 6a. To generate this map, we add the weight 
of each selected feature to the cells it covers and use different 
colors to indicate the accumulative weight of each cell after 
boosting. As expected, the head-shoulder area of the human 
body shows to be more discriminative for pedestrian detection 
than other body parts. Moreover, we also add the weight of 
each feature separated by channels to indicate which ones are 
more representative and use bars to illustrate the accumulative 
weight of each channel as shown in Fig. 6b. We find that all 
the channels we chose contributed rather evenly to the final 
classifier, indicating no channel redundancy. 




INRIA [9] 

Caltech [32] 

Properties 

imaging setup 
color images 
video seqs. 
occlusion labels 

photo 

v 

X 

X 

mobile 

v 

v 

v 


# pedestrians 

1208 

192k 

Training 

# pos. images 

614 

67k 


# neg. images 

1218 

61k 


# pedestrians 

566 

155k 

Testing 

# pos. images 

288 

65k 


# neg. images 

453 

56k 


TABLE II: Statistics of two pedestrian datasets used for 
experiments [1]. 


The most discriminative features selected by the boosting 
algorithm are then used for pedestrian detection in still images. 
Our pedestrian model is of size 60 x 120 pixels and we resize 
the input image to detect pedestrians at different scales. To this 
end, we slide a window over the whole image and consider 
multiple scales. The spatial step size is set identical to the cell 
size for speed reasons and the scale step is set to be 1.09 so 
that there are 8 scales per octave. We use a simplified non- 
maximal suppression (NMS) procedure [21] to suppress nearby 
detections. 


VII. Experiments 

In this section, we introduce the benchmark datasets and 
evaluation protocols used in our experiments, provide com¬ 
prehensive comparisons for different feature schemes, and 
compare our best detector configuration with state-of-the-art 
detectors. 

A. Benchmark datasets 

Experiments are conducted on two public benchmark 
datasets: the INRIA Person Dataset [S ] and the Caltech Pedes¬ 
trian Detection Benchmark [1]. A comparison of the above two 
datasets is given in Tab. II. 

INRIA Person Dataset'. This is arguably the most popular 
dataset for people detection and comes along with pre-defined 
subsets for training and testing. In the training set, there 
are 2416 positive samples, by mirroring from 1208 identical 
pedestrian images; and 12,180 natural images, where no pedes¬ 
trians are included so that negative samples can be randomly 
generated by cropping subregions. In the test set, there are 288 
positive samples, including 566 pedestrian annotations. 

Caltech Pedestrian Detection Benchmark'. This is currently 
the largest and most challenging dataset for pedestrian detec¬ 
tion, consisting of approximately 10 hours of 640x480 30Hz 
video taken from a vehicle driving through regular traffic in 
an urban environment. About 250,000 frames with a total of 
350,000 bounding boxes and 2300 unique pedestrians were 
annotated. The training data (set00-set05) and the test data 
(set06-setl0) consist of approximately 192,000 and 155,000 
pedestrian annotations, respectively. 























B. Evaluation protocol 

In the following, we explain details of our evaluation proto¬ 
col in four aspects, which are consistent with the conventions 
in this field [1]. 

1) Ground truth filtering: In our experiments, a reasonable 
subset of all ground truth data is considered, in which pedes¬ 
trians at a resolution of over 50 pixels in height and a visibility 
of more than 65% are considered. Outliers are marked with an 
ignore label, which means they need not be matched, however, 
matches are not considered as mistakes either. 

2 ) Detection results filtering: We filter out detection results 
using an expanded filtering method [1], so that detection results 
far outside the evaluation scale range should not be considered. 
In this paper, we evaluate a scale range of [50, +oo], only 
detections in [50/£, +oo] are considered for evaluation. In our 
experiments, we set £ = 1.25 [1]. 

3) Bounding box matching rules: A filtered ground truth 
bounding box and detection bounding box are annotated by 
B gt , and B dt respectively. B gt , and B& match if and only if 
the ratio of overlap to the union of their areas exceeds a given 
threshold [1]: 


match^dt, -Bgt) 


area(Bdt) fl area(B gt ) ^ q ^ 
area(Bdt) U area(B gt ) 


(15) 


4) Performance measurements: We perform evaluation w.r.t. 
full images instead of detection windows as the former one 
provides a natural measure of error of an overall detection 
system. In this paper, we employ two measurements to com¬ 
pare performance among different detectors. First, we plot 
miss rate against false positives per image (FPPI) curves in 
logarithmic scales by varying the threshold on the detection 
confidence of the classifiers. In addition to this miss rate vs. 
FPPI curves, we calculate a single, numerical measurement to 
summarize each detector’s performance. We use the average 
miss rate [1], which is computed by averaging the miss rates 
at nine FPPI rates evenly sampled in log-space in the range 
of [10 -2 ,10°]. This average miss rate generally gives a more 
stable and informative assessment of the overall performance 
for different detectors than the miss rate at only 10“ 1 FPPI [1]. 


C. Comparisons for different feature settings 

In this section, we seek the strongest feature scheme through 
experiments under different settings on the INRIA dataset. 
First, we define a default setting with: three scales of 4 x 4, 
6x6, and 8x8 pixels; 5 histogram bins when histograms are 
used. 

In the following, we compare different descriptors, contrast 
measurements, scale structures, and numbers of histogram bins 
where histograms are used. 

1) Contrast measurements: We investigate different contrast 
measures for two descriptors respectively. From Fig. 7a, we 
see that both descriptors produce stable results using different 
contrast measures. Despite of their stable performance, we 
observe a slight difference between different contrast measures. 
For Gaussian descriptors, L 2 distance performs worst, and 
W 2 distance and SGrd produce comparably better results. For 
histograms, HI is the worst measure, and KLD and Hellinger 



(a) Gaussian distributions 



Fig. 7: Experiments on different contrast measures for two cell 
descriptors. 


are comparably better. Therefore, we select Gaussian-W^ and 
Hist-Hellinger as the two preferable combinations which pro¬ 
duce best results for Gaussian and Histogram descriptors, 
respectively. 

2) Number of histogram bins: The number of bins is an im¬ 
portant parameter in practical applications of histograms. Not 
surprisingly, we observe performance changes when increasing 
the number of histogram bins. Fig. 8 shows experimental 
results when using 5, 10, 15 and 20 bins and the Hellinger 
distance which has been shown to be the best among all the 
contrast measures for histograms. Generally, more histogram 
bins integrate more accurate information of the local cell 
region, thus leading to better performance. If we increase the 
number of histogram bins from 5 to 15, as expected we obtain 
better results. However, performance begins to decrease again 
when we consider more than 20 bins since these settings are 
more error prone under noisy real world data. Note that, in the 
following experiments, we thus use 15-bin histograms instead 
of the default 5-bin histograms. 








































































Fig. 8: Experiments with different histogram bins using the 
Hellinger distance. 



Fig. 10: Comparison of two center-surround patterns. 



Fig. 9: Comparison of two optimal descriptor-measurement 
combinations and the baseline detector [ChnFtrs]. 



Fig. 11: Comparison of three scale structures. In this experi¬ 
ment, 15-bin histograms are used. 


3) Descriptors: From Fig. 9, we can see that both opti¬ 
mal combinations outperform the baseline detector [ChnFtrs] 
which illustrates the effectiveness of our new features. Be¬ 
tween the two optimal combinations, Gaussian->V 2 produces 
better overall results than Hist(15 bins)-Hellinger. Therefore, 
Gaussian->V 2 is selected as the optimal descriptor-contrast- 
measurement combination in this paper. 

4) Ci Ss pattern vs. C\ S\ pattern: We proposed the C\Sg 
pattern in Section III in order to incorporate more information 
about local image differences. Here, we compare the perfor¬ 
mance of both patterns to show why the directed C\ Sg pattern 
is superior. From Fig. 10, it appears that the C\Sg pattern 
produces better results than C\S\ over all descriptor-contrast- 
measurement combinations. 

5) Scale structures: Generally, the use of more scales incor¬ 
porates richer information and leads to a better performance. 
In this paper, we consider three different scale structures: 
4-6; 4-6-8; and 4-6-8-10, and show their comparisons in 
Fig. 11. Increasing the scales from 4-6 to 4-6-8 brings about a 
significant improvement of approximately 5% w.r.t. miss rate; 
on the other hand, continuing to increase scales to 4-6-8-10 
produces a less prominent performance gain of less than 1% 


w.r.t. miss rates. Therefore, we choose a scale structure of 4- 
6-8-10 as our optimal choice. 

In summary, the optimal feature setting is to use the com¬ 
bination of Gaussian-W^ and scale structure of 4-6-8-10. We 
use this configuration in the following experiments. 

D. Computational complexity 

We investigate the computational complexity of different 
feature settings. Our normal-distribution as well as histogram 
based features are computed from local averages of certain 
values. Such local features can be computed in 0(n ) time with 
n denoting the number of image pixels using moving averages 
or integral image techniques. They only differ in the number 
of layers needed (one for each distribution parameter or bin) 
which amounts to a constant factor. Looking into details of the 
diverse distance functions implemented for different feature 
settings, we can see the very same effect: the time complexity 
is constant per pixel (linear growing with image size), so the 
overall complexity for each setting is still 0(n). We have to 
note that the constant factor for normal-distributions is 2 per 
input channel, while histograms require b > 2 ( e.g . 15) number 
of histogram bins. 





















































































Detector 

Average miss rate 


INRIA 

Caltech 

VJ[33] 

72.48% 

94.73% 

HOG[ ] 

45.98% 

68.46% 

Shapelet[34] 

81.70% 

91.37% 

MultiFtr [35] 

36.50% 

68.62% 

MultiFtr+CSS [ 6] 

24.74% 

60.89% 

MultiFtr+Motion [16] 

/ 

50.88% 

HikSvm [36] 

42.82% 

73.39% 

HogLbp [12] 

39.10% 

67.77% 

LatSvm-V 1 [3 ] 

43.83% 

79.78% 

LatSvm-V2 [38] 

19.96% 

63.26% 

ChnFtrs [2 ] 

22.18% 

56.34% 

FeatSynth [39] 

30.88% 

60.16% 

MultiResC [40] 

/ 

48.45% 

CrossTalk [41] 

18.98% 

53.88% 

VeryFast [ 42 ] 

15.96% 

/ 

SketchTokens [43] 

13.32%* 

/ 

Roerei [44] 

13.53%* 

48.35% 

RandForest [ 5] 

15.37%* 

51.17% 

AFS+Geo [46] 

/ 

66.76% 

MT-DPM+Context [47] 

/ 

37.64%* 

DBN-Isol [48] 

/ 

53.14% 

DBN-Mut [49] 

/ 

48.22% 

ACF+SDt [50] 

/ 

37.34%* 

ours 

15.90% 

34.96%* 


TABLE III: Performance comparisons to state-of-the-art pedes¬ 
trian detectors. Each row in this table displays the correspond¬ 
ing average performance in terms of average miss rates. The 
approach proposed in this paper yields state-of-the-art perfor¬ 
mance on the INRIA dataset and consistently better results 
than previously reported methods on the Caltech dataset. We 
indicate the top three detectors for each dataset with *. 


The computational complexity of our baseline detector 
[ChnFtrs] is also 0(n), because each pixel is visited once 
per channel for computing local sums. Therefore, the compu¬ 
tational complexity of our features at different settings is on 
par with [ChnFtrs]. 

E. Comparisons with state-of-the-art detectors 

In this section, we compare the performance of our detector 
with optimal settings found in Section VII-C to state-of-the- 
art detectors whose results are publicly available 1 using the 
experimental protocol explained in Section VII-B. 

The results on the INRIA dataset in Fig. 13a show that 
our detector outperforms the baseline detector [ChnFtrs] by 
about 6% and reaches state-of-the-art performance. On the 
Caltech pedestrian dataset, our detector outperforms not only 

1 http://www.vision.caltech.edu/Image_Datasets/CaltechPedestrians/ 


the baseline detector [ChnFtrs] by about 15% but also yields 
the overall best performance as shown in Fig. 13c. More 
extensive comparisons are shown in Tab. III. 

Moreover, we show results under different occlusion con¬ 
ditions on the Caltech dataset, as shown in Fig. 13. Our 
detector ranks stably high across different occlusion levels, 
and outperforms some part-based approaches, e.g. MT-DPM. 
These evaluations demonstrate that our approach is robust to 
occlusions. The major reason is that most informative features 
are automatically selected from the head-shoulder area (as 
shown in Fig. 6a), where occlusion is less likely to happen. 

VIII. Conclusion 

Humans are able to efficiently locate what they are looking 
for because the human visual system is tuned to characteristic 
visual features so that objects of interest become salient. This 
mechanism is called top-down saliency or visual search. In 
this paper, we tried to mimic early human visual processing 
by using local distribution contrast features and boosted them 
to respond to the appearance of pedestrians. The resulting 
pedestrian detector thus realizes a computational top-down 
saliency system. Our features are very efficient to compute by 
means of combining a fast integral method for local averaging 
and a clever arrangement of additional image layers for fast 
maximum likelihood estimation of parameters of normal distri¬ 
butions. We tested different patterns for organizing the center- 
surround structure and scale structure as well as different ways 
to estimate the cell distribution and contrast measurements. 

Experimental results showed that our detector achieves 
state-of-the-art performance on the INRIA pedestrian dataset. 
Moreover, on the Caltech pedestrian dataset, we found it to 
outperform all other recent approaches considered. 

Given these results, it appears promising to further explore 
feature design driven by human visual mechanisms. Immediate 
extensions of the approach presented in this paper consist in 
incorporating information from additional channels such as 
motion and depth. This is currently explored in ongoing work 
and results will be reported once they become available. 
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